Main Questions
Our main question:
How did opioid shipment rates differ between rural and urban regions over time around the US from 2006-2014?
####
Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
To cite this case study please use:
Wright, Carrie, and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://opencasestudies.github.io/ocs-bp-opioid-rural-urban/ocs_pop.html. Opioids in the United States (Version v1.0.0).
avocado update url ####
In this case study we will be examining the number of opioid pills (specifically oxycodone and hydrocodone, as they are the top two abused opioids) shipped to pharmacies and paractitionaers at the county-level around the United States (US) from 2006 to 2014.
This data comes from the DEA Automated Reports and Consolidated Ordering System (ARCOS) and was released by the Washington Post after legal action by the owner of the Charleston Gazette-Mail in West Virginia and the Washington Post.
We will investigate how the number of shipped pills compares for rural and urban counties. This analysis will demonstrate how different regions of the country may have been more at risk for opioid addiction crises due to differing rates of opioid prescription (using the number of pills as a proxy for perscription rates). This will help inform students about how evidence-based intervention decisions are made in this area.
This case study is motivated by this article:
García, M. C. et al. Opioid Prescribing Rates in Nonmetropolitan and Metropolitan Counties Among Primary Care Providers Using an Electronic Health Record System — United States, 2014–2017. MMWR Morb. Mortal. Wkly. Rep. 68, 25–30 (2019). DOI: 10.15585/mmwr.mm6802a1
This article explores rates of opioid perscriptions in rural and urban communties in the United States using the Athenahealth electronic health record (EHR) system for 31,422 primary care providers from January 2014 to March 2017.
The main takeaways from this article were:
Among 70,237 fatal drug overdoses in 2017, prescription opioids were involved in 17,029 (24.2%).
The percentage of patients prescribed an opioid was higher in rural than in urban areas.
Higher opioid prescribing rates put patients at risk for addiction and overdose.
Indeed, this was confirmed by another article which surveyed heroin users.
Cicero, T. J., Ellis, M. S., Surratt, H. L. & Kurtz, S. P. The Changing Face of Heroin Use in the United States: A Retrospective Analysis of the Past 50 Years. JAMA Psychiatry 71, 821 (2014). DOI:10.1001/jamapsychiatry.2014.366
They found that:
Respondents who began using heroin in the 1960s were predominantly young men (82.8%; mean age, 16.5 years) whose first opioid of abuse was heroin (80%).
However, more recent users were older (mean age, 22.9 years) men and women living in less urban areas (75.2%) who were introduced to opioids through prescription drugs (75.0%).
Heroin use has changed from an inner-city, minority-centered problem to one that has a more widespread geographical distribution, involving primarily white men and women in their late 20s living outside of large urban areas.
Photo by James Yarema on Unsplash
Our main question:
How did opioid shipment rates differ between rural and urban regions over time around the US from 2006-2014?
In this case study, we will demonstrate how to obtain data from an Application Programming Interface (API), which is an interface that allows users to more easily interact with a database. We will also especially focus on using packages and functions from the Tidyverse, such as dplyr, tidyr. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R more legible and intuitive.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data science skills:
httr and jasonlite)NA values (tidyr)dplyrformattableggplot2naniar)Statistical concepts and methods:
We will begin by loading the packages that we will need:
library(readxl)
library(tibble)
library(httr)
library(jsonlite)
library(stringr)
library(here)
library(readr)
library(magrittr)
library(dplyr)
library(tidyr)
library(naniar)
library(ggplot2)
library(ggpol) #creates geomjitter
library(ggiraph) # creates interactive plot for plots that are difficult to label because they have so many elements
library(ggpubr) #ggarange
# library(devtools)
# library(usethis)
# library(tidyverse)
# library(openintro)
# #library(arcos)
# library(ggiraph)
# library(ggpubr)
# library(ggfortify)
# library(ggpol)
# library(lme4)
library(formattable)Packages used in this case study:
| Package | Use in this case study |
|---|---|
| here | to easily load and save data |
| readr | to import the data as a csv file |
| tibble | to create tibbles (the tidyverse version of dataframes) |
| dplyr | to filter, subset, join, add rows to, and modify the data |
| stringr | to manipulate character strings within the data (collapsing strings together, replace values, and detect values) |
| magrittr | to pipe sequential commands |
| tidyr | to change the shape or format of tibbles to wide and long, to drop rows with NA values, and to see the last few columns of a tibble |
| ggmap | to geocode the data (which means get the latitude and longitude values) |
| sf | to modify the geocoded data so that overlapping points did not overlap |
| lubridate | to work with the data-time data |
| DT | to create the interactive table |
| htmltools | to add a caption to our interactive table |
| ggplot2 | to create plots |
| ggforce | to create a plot zoom |
| forcats | to reorder factor for plot |
| waffle | to make waffle proportion plots |
| poliscidata | to get population values for the states |
| flexdashboard | to create the dashboard |
| shiny | to allow our dashboard to be interactive |
| leaflet | to implement the leaflet (a JavaScript library for maps) to create the map for our dashboard |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
What exactly are opioids?
According to the DEA and the Alta Mira addiction treatment center:
Opioids (also known as narcotics which comes from the Greek word for “stupor”), describes a class of drugs that contain opium (the poppy plant - Papaver somniferum), are derived from opium, or contain a semi-synthetic or synthetic substitute for opium.
Photo by Ingo Doerrie on Unsplash
Hoewver, technically, opioids are substances that bind to the opioid receptors in the body, which are involved in the sensation of pain and the experience of reward. There are actually opioids that naturally are made by the human body, the most well known being the endorphins.
Oppoid drugs tend to be addictive becuase they modulate the reward system. This is the part of the brain that reinforces behaviors (normally these are behaviours such as drinking water or eating food) by causing the experience of pleasure (through the release of a neurotransmitter called dopamine).
This same system can be activated by both opioids that naturally occur in the body, as well as opioid perscription drugs and other addictive drugs. Activation of this sytem by drug use leads to very high releases of Dopamine and the sensation of pleasure which ultimately leads to reinforced use of that drug.
In general, opioids medications and drugs have been found to dull the senses, releive pain, supress cough, reduce respiration and heart rate, induce constipation, and induce feelings of euphoria. They have a high potential for abuse and addiction.
Drugs within this class include (with perscription drug brand names are shown in parentheses):
Opium comes from the fluid (which is also called poppy tears) inside the seed capusules of the Papaver somniferum plant. This contains morphine, codeine, and thebaine. This is then dried.
Opium has been used by humans since 5000 BCE and it has been used across the world. See here for an interesting overview of the history.
Opium derived medications were historically used in United States to treat a variety of ailments besides pain including: cholera, dysentery, tubuerculosi, and mental illness.
Of note, they state that “from 1898 to 1910 heroin was marketed as a non-addictive morphine substitute and cough medicine for children”!
Here you can see a photo of a bottle of herion:
Opioids have continued to be used in the treatment of pain.
The opioid epidemic began in the late 1990s.
According to the US department of health and human services (HHS):
In the late 1990s, pharmaceutical companies reassured the medical community that patients would not become addicted to opioid pain relievers and healthcare providers began to prescribe them at greater rates.
Increased prescription of opioid medications led to widespread misuse of both prescription and non-prescription opioids before it became clear that these medications could indeed be highly addictive.
In 2017 the HHS declared a public health emergency
See here for a timeline of the epidemic in the US and here for more detials about the epidemic.
According to this article from the Morbidity and Mortality Weekly Report (MMWR) of the Centers for Disease Control and Prevention (CDC):
Drug overdose is the leading cause of unintentional injury-associated death in the United States.
You can see that moth recent overdose deaths were due to the use of synthetic opioids, where as previous high levels of overdoses (till about 2015) were attributable to natural and semi-synthetic opioids (which is what we will look at in this case study).
They state that:
From 1999–2018, almost 450,000 people died from an overdose involving any opioid, including prescription and illicit opioids.
Importantly rates appear to differ across states, according to this CDC report
According to the motivating report for our case study:
Perscription rates are now declining, however, perscription of opioids was found to be higher in rural areas rather than urban ares.
It is important to identify locations where people are particularly vulernable to target interventions for communities that need it the most.
There are some important considerations regarding this data analysis to keep in mind:
According to the [Washington Post data](https://www.washingtonpost.com/national/2019/07/18/how-download-use-dea-pain-pills-database/ about the DEA data:
“It’s important to remember that the number of pills in each county does not necessarily mean those pills went to people who live in that county. The data only shows us what pharmacies the pills are shipped to and nothing else.”
Furthermore, we will define counties as being rural or urban however there can be great variation within a county and we used land area values form only 2010 even though these can fluctuate. Therefore the way we categorized counties should be seen as an approximation.
Finally, overdose deaths are often due to the use of multiple substances. Simply because a county recieved more pills does not mean that people in that county would experience more drug overdoses. It is also important to remember that perscription opioids only account for a portion of the drug overdose deaths reported in this time period. However, according to this article 75% of heroin users surveyed were introduced to opioids through perscription drug use.
We will use data from two sources:
The US census for land area of counties to allow us to extimate county-level population density
The Washington Post datafrom the Drug Enforcement Administration (DEA) about opioid (oxycodone and hydrocodone) pill shipments to pharmacies and paractitionaers around the US at the county-level
This dataset was released in July of 2019 and has been controversial as according to the Washington Post:
The disclosure is part of a civil action brought by 2,500 cities, towns, counties and tribal nations alleging that nearly two dozen drug companies conspired to saturate the nation with opioids.
See here for more details about how this database was released.
The Washington Poststates that they:
.. cleaned the data to include only information on shipments of oxycodone and hydrocodone pills. We did not include data on 10 other opioids because they were shipped in much lower quantities…
It’s important to remember that the number of pills in each county does not necessarily mean those pills went to people who live in that county. The data only shows us what pharmacies the pills are shipped to and nothing else.
This data was part of the [Automated Reports and Consolidated Ordering System (ARCOS)]https://www.deadiversion.usdoj.gov/arcos/retail_drug_summary/ of the DEA in which:
manufacturers and distributors report their controlled substances transactions
Their website indicates that:
The Controlled Substances Act of 1970 created the requirement for Manufacturers and Distributors to report their controlled substances transactions to the Attorney General. The Attorney General delegates this authority to the Drug Enforcement Administration (DEA).
ARCOS is an automated, comprehensive drug reporting system which monitors the flow of DEA controlled substances from their point of manufacture through commercial distribution channels to point of sale or distribution at the dispensing/retail level - hospitals, retail pharmacies, practitioners, mid-level practitioners, and teaching institutions. Included in the list of controlled substance transactions tracked by ARCOS are the following: All Schedules I and II materials (manufacturers and distributors); Schedule III narcotic and gamma-hydroxybutyric acid (GHB) materials (manufacturers and distributors); and selected Schedule III and IV psychotropic drugs (manufacturers only).
The annual report about the data from 2019, can be found here.
As this is a very large dataset, thus the Washington Post created an application prgoramming interface (API) to make it easier for users to access the data.
An API is a computational interface that simplifies interactacts with a data or file system for a user. It is similar to a Graphical User Interface GUI, yet it allows the user some more flexibility/functionality.
This link takes you to the Washington Post ARCOS API.
There was also an R package on cran called arcos for interacting with the API, but this has been archived. This package is however still available here on Github.
See here for more information about how to get access the Washington Post DEA database.
We will need land area data for our calculations of population density.
We obtained county land area data from the US census Bureau at this link
This link explains how the data is formated.
We will use the read_excel() function of the readxl package to import the data. We will also convert the data into a tibble (which is a the tidyverse version of a data frame) by using the as_tibble() function of the tibble package.
We can take a look at the data using the base head() function which will show the frist 6 rows.
# A tibble: 6 x 34
Areaname STCOU LND010190F LND010190D LND010190N1 LND010190N2 LND010200F
<chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 UNITED … 00000 0 3787425. 0000 0000 0
2 ALABAMA 01000 0 52423. 0000 0000 0
3 Autauga… 01001 0 604. 0000 0000 0
4 Baldwin… 01003 0 2027. 0000 0000 0
5 Barbour… 01005 0 905. 0000 0000 0
6 Bibb, AL 01007 0 626. 0000 0000 0
# … with 27 more variables: LND010200D <dbl>, LND010200N1 <chr>,
# LND010200N2 <chr>, LND110180F <dbl>, LND110180D <dbl>, LND110180N1 <chr>,
# LND110180N2 <chr>, LND110190F <dbl>, LND110190D <dbl>, LND110190N1 <chr>,
# LND110190N2 <chr>, LND110200F <dbl>, LND110200D <dbl>, LND110200N1 <chr>,
# LND110200N2 <chr>, LND110210F <dbl>, LND110210D <dbl>, LND110210N1 <chr>,
# LND110210N2 <chr>, LND210190F <dbl>, LND210190D <dbl>, LND210190N1 <chr>,
# LND210190N2 <chr>, LND210200F <dbl>, LND210200D <dbl>, LND210200N1 <chr>,
# LND210200N2 <chr>
Looks good!
The httr package formats what are called “GET requests” so that they will work properly. This allows for the data to be retrieved from the API.
The jsonlite package alows you to convert the data from JSON (often used by APIs) to a differet format that is easier to work with.
APIs typically require a password or key to gain access. Thus the httr package helps to authenticate your data request. Often these keys are something that you do not want to share, unless the API is public.
In our case the API is indeed public, and currently “uO4EK6I” is publicly published as a key to use on the github page for the arcos package. We will use that here to access the API.
We are interested in the county level data - first let’s get the population data. We can access it by:
GET button on the API.This gives us the following output:
curl -X GET "https://arcos-api.ext.nile.works/v1/county_population?key=uO4EK6I" -H "accept: application/json"
We can copy the URL section "https://arcos-api.ext.nile.works/v1/county_population?key=uO4EK6I" and use it in the GET() function of the httr package :
If we needed to specify a username and password, we would do so using the authenticate() function of the httr package within the GET function. The authenticate() function takes user, password and type arguments.
Here is an example:
The default type is "basic" and typcally what is needed.
Now that we have used the GET function, we have a JavaScript Object Notation (JSON) file of the data.
JSON files are lightweight meaning that they don’t take up much memory and they are human readible files to make transmitting data from websites easier.
Response [https://arcos-api.ext.nile.works/v1/county_population?key=uO4EK6I]
Date: 2020-09-18 20:19
Status: 200
Content-Type: application/json
Size: 5.75 MB
Here we can see that the object called countyjson is a json object. You will also see that the Satus is 200, which means that we were sucessful in retreiving the data from the API.
Now we can use the content() funtion of the httr package to extract the text from the file:
This will be a very large string at this point, we can take a look at part of it by using the str_sub() function of the stringr package. In this case we will only look at the first 400 characters.
What is a string or a chracter?
Click here for an explanation about character strings if you are not yet familiar
There are several classes of data in R programming. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter "a" or the number "3".
If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense.
[1] "[{\"BUYER_COUNTY\":\"AUTAUGA\",\"BUYER_STATE\":\"AL\",\"countyfips\":\"01001\",\"STATE\":1,\"COUNTY\":1,\"county_name\":\"Autauga\",\"NAME\":\"Autauga County, Alabama\",\"variable\":\"B01003_001\",\"year\":2006,\"population\":51328},{\"BUYER_COUNTY\":\"BALDWIN\",\"BUYER_STATE\":\"AL\",\"countyfips\":\"01003\",\"STATE\":1,\"COUNTY\":3,\"county_name\":\"Baldwin\",\"NAME\":\"Baldwin County, Alabama\",\"variable\":\"B01003_001\",\"year\":2006,\"population\":168121"
Now to get the data into a more readible format , we can use the fromJSON() function of the jsonlite package and again create a tibble of the data using as_tibble()
county_pop <- jsonlite::fromJSON(county_pop_text, flatten = TRUE)
county_pop <- as_tibble(county_pop)We can use the glimpse() function and the distinct() function of the dplyr package to get a better sense of the data. The distinct() function allows us to take a look at the unique values of the year variable.
Rows: 28,265
Columns: 10
$ BUYER_COUNTY <chr> "AUTAUGA", "BALDWIN", "BARBOUR", "BIBB", "BLOUNT", "BULL…
$ BUYER_STATE <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A…
$ countyfips <chr> "01001", "01003", "01005", "01007", "01009", "01011", "0…
$ STATE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ COUNTY <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 3…
$ county_name <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bull…
$ NAME <chr> "Autauga County, Alabama", "Baldwin County, Alabama", "B…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ year <int> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 20…
$ population <int> 51328, 168121, 27861, 22099, 55485, 10776, 20815, 115388…
# A tibble: 9 x 1
year
<int>
1 2006
2 2007
3 2008
4 2009
5 2010
6 2011
7 2012
8 2013
9 2014
It looks like we have the full data from 2006-2014.
We are also interested in opioid pill shipment data at the county level.
Here is the result of the same steps using the API for the combined_county_annual data:
curl -X GET "https://arcos-api.ext.nile.works/v1/combined_county_annual?key=uO4EK6I" -H "accept: application/json"
Question Opportunity
See if you can fix import the data without looking at the code for the population data.
Click here to reveal the code.
county_annual_json<-httr::GET(url = "https://arcos-api.ext.nile.works/v1/combined_county_annual?key=uO4EK6I")
county_annual_json_text <-httr::content(county_annual_json, "text")
county_annual <- jsonlite::fromJSON(county_annual_json_text, flatten = TRUE)
annualDosage <- tibble::as_tibble(county_annual)Now let’s take a look at the data:
Rows: 27,758
Columns: 6
$ BUYER_COUNTY <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABB…
$ BUYER_STATE <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "L…
$ year <int> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 20…
$ count <int> 877, 908, 871, 930, 1197, 1327, 1509, 1572, 1558, 5802, …
$ DOSAGE_UNIT <dbl> 363620, 402940, 424590, 467230, 539280, 566560, 589010, …
$ countyfips <chr> "45001", "45001", "45001", "45001", "45001", "45001", "4…
# A tibble: 9 x 1
year
<int>
1 2006
2 2007
3 2008
4 2009
5 2010
6 2011
7 2012
8 2013
9 2014
Looks like we have the same years of data.
Now let’s take a deaper look at the data to see if we have any missing data using the naniar package.
We can use the vis_miss() function to create a plot of missing data.
Let’s start with the land area data.
Looks like there is no missing data.
How about the population data:
We appear to be missing some values for the Name and variable data, but we don`t intend to use these, so this should be ok. It is however a good idea to check these rows to see if anything strange is happening.
Let’s use the filter() function of the dplyr package and the is.na() base function to see more about the data that does not have countyfips codes.
We will also start using the %>% pipe of the magrittr package for our assignments.
Click here if you are unfamiliar with piping in R, which uses this
%>% operator
By piping we mean using the %>% pipe operator which is accessible after loading the tidyverse or several of the packages within the tidyverse like dplyr because they load the magrittr package. This allows us to perform multiple sequential steps on one data input.
# A tibble: 15 x 10
BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name NAME variable
<chr> <chr> <chr> <int> <int> <chr> <chr> <chr>
1 PRINCE OF W… AK 02198 2 201 Prince of … <NA> <NA>
2 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
3 WRANGELL AK 02275 2 280 Wrangell <NA> <NA>
4 PRINCE OF W… AK 02198 2 201 Prince of … <NA> <NA>
5 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
6 WRANGELL AK 02275 2 280 Wrangell <NA> <NA>
7 PRINCE OF W… AK 02198 2 201 Prince of … <NA> <NA>
8 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
9 WRANGELL AK 02275 2 280 Wrangell <NA> <NA>
10 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
11 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
12 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
13 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
14 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
15 SKAGWAY HOO… AK 02232 2 232 Skagway Ho… <NA> <NA>
# … with 2 more variables: year <int>, population <int>
This looks ok. So let’s now move on to the DEA data.
Interesting, we appear to be missing countyfips codes for a small percentage of our annual data.
Let’s take a look at this data:
# A tibble: 760 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 ADJUNTAS PR 2006 147 102800 <NA>
2 ADJUNTAS PR 2007 153 104800 <NA>
3 ADJUNTAS PR 2008 153 45400 <NA>
4 ADJUNTAS PR 2009 184 54200 <NA>
5 ADJUNTAS PR 2010 190 56200 <NA>
6 ADJUNTAS PR 2011 186 65530 <NA>
7 ADJUNTAS PR 2012 138 57330 <NA>
8 ADJUNTAS PR 2013 138 65820 <NA>
9 ADJUNTAS PR 2014 90 59490 <NA>
10 AGUADA PR 2006 160 49200 <NA>
# … with 750 more rows
It looks like the missing data is data for Puerto Rico - it makes sense that it doesn’t have countyfips codes.
Let’s see if there is any data other than data for Puerto Rico that is also missing countyfips values. We can use the != operator which indicates not equal to.
# A tibble: 74 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 GUAM GU 2006 319 265348 <NA>
2 GUAM GU 2007 330 275600 <NA>
3 GUAM GU 2008 313 286900 <NA>
4 GUAM GU 2009 390 355300 <NA>
5 GUAM GU 2010 510 413800 <NA>
6 GUAM GU 2011 559 475600 <NA>
7 GUAM GU 2012 616 564800 <NA>
8 GUAM GU 2013 728 623200 <NA>
9 GUAM GU 2014 712 558960 <NA>
10 MONTGOMERY AR 2006 469 175390 <NA>
11 MONTGOMERY AR 2007 597 241270 <NA>
12 MONTGOMERY AR 2008 561 251760 <NA>
13 MONTGOMERY AR 2009 554 244160 <NA>
14 MONTGOMERY AR 2010 449 247990 <NA>
15 MONTGOMERY AR 2011 560 313800 <NA>
16 MONTGOMERY AR 2012 696 339520 <NA>
17 MONTGOMERY AR 2013 703 382300 <NA>
18 MONTGOMERY AR 2014 491 396900 <NA>
19 NORTHERN MARIANA ISLANDS MP 2006 165 117850 <NA>
20 NORTHERN MARIANA ISLANDS MP 2007 157 117500 <NA>
21 NORTHERN MARIANA ISLANDS MP 2008 204 143000 <NA>
22 NORTHERN MARIANA ISLANDS MP 2009 269 186900 <NA>
23 NORTHERN MARIANA ISLANDS MP 2010 231 196360 <NA>
24 NORTHERN MARIANA ISLANDS MP 2011 264 208500 <NA>
25 NORTHERN MARIANA ISLANDS MP 2012 290 217400 <NA>
26 NORTHERN MARIANA ISLANDS MP 2013 258 231400 <NA>
27 NORTHERN MARIANA ISLANDS MP 2014 260 239200 <NA>
28 PALAU PW 2006 5 14000 <NA>
29 PALAU PW 2007 9 26600 <NA>
30 PALAU PW 2008 2 7500 <NA>
31 PALAU PW 2009 3 10000 <NA>
32 PALAU PW 2013 1 1000 <NA>
33 SAINT CROIX VI 2006 544 198800 <NA>
34 SAINT CROIX VI 2007 612 237120 <NA>
35 SAINT CROIX VI 2008 694 254020 <NA>
36 SAINT CROIX VI 2009 601 233860 <NA>
37 SAINT CROIX VI 2010 764 316260 <NA>
38 SAINT CROIX VI 2011 756 320850 <NA>
39 SAINT CROIX VI 2012 755 314690 <NA>
40 SAINT CROIX VI 2013 802 328410 <NA>
41 SAINT CROIX VI 2014 684 269040 <NA>
42 SAINT JOHN VI 2006 65 22200 <NA>
43 SAINT JOHN VI 2007 60 21800 <NA>
44 SAINT JOHN VI 2008 70 24700 <NA>
45 SAINT JOHN VI 2009 58 23100 <NA>
46 SAINT JOHN VI 2010 75 23500 <NA>
47 SAINT JOHN VI 2011 89 30200 <NA>
48 SAINT JOHN VI 2012 85 30200 <NA>
49 SAINT JOHN VI 2013 66 22000 <NA>
50 SAINT JOHN VI 2014 63 20400 <NA>
51 SAINT THOMAS VI 2006 628 219100 <NA>
52 SAINT THOMAS VI 2007 757 249480 <NA>
53 SAINT THOMAS VI 2008 815 294250 <NA>
54 SAINT THOMAS VI 2009 798 313200 <NA>
55 SAINT THOMAS VI 2010 802 318630 <NA>
56 SAINT THOMAS VI 2011 932 383350 <NA>
57 SAINT THOMAS VI 2012 939 373280 <NA>
58 SAINT THOMAS VI 2013 988 376400 <NA>
59 SAINT THOMAS VI 2014 1021 314440 <NA>
60 <NA> AE 2006 2 330 <NA>
61 <NA> CA 2006 47 12600 <NA>
62 <NA> CT 2006 305 78700 <NA>
63 <NA> CT 2007 112 30900 <NA>
64 <NA> CT 2008 48 15000 <NA>
65 <NA> FL 2006 9 900 <NA>
66 <NA> FL 2007 7 700 <NA>
67 <NA> GA 2006 114 51700 <NA>
68 <NA> IA 2006 7 2300 <NA>
69 <NA> IN 2006 292 39300 <NA>
70 <NA> MA 2006 247 114900 <NA>
71 <NA> NV 2006 380 173600 <NA>
72 <NA> NV 2007 447 200600 <NA>
73 <NA> NV 2008 5 2200 <NA>
74 <NA> OH 2006 23 5100 <NA>
It looks like there is also data for other territories in the dataset, as well as some counties with no name.
For some reason the rows for the Montgomery county of Arkansa are also missing a countyfips value.
# A tibble: 9 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 MONTGOMERY AR 2006 469 175390 <NA>
2 MONTGOMERY AR 2007 597 241270 <NA>
3 MONTGOMERY AR 2008 561 251760 <NA>
4 MONTGOMERY AR 2009 554 244160 <NA>
5 MONTGOMERY AR 2010 449 247990 <NA>
6 MONTGOMERY AR 2011 560 313800 <NA>
7 MONTGOMERY AR 2012 696 339520 <NA>
8 MONTGOMERY AR 2013 703 382300 <NA>
9 MONTGOMERY AR 2014 491 396900 <NA>
According to this website thie fIPS code is 05097.
We will update these values in the next section.
We want the LND110210D column which is the data from the year 2010.
LND = Land Area 110 = unit square miles (subgroup-code of the group) * avocado I found this somehwere else.. the census info was vauge would like to confirm that that is indeed what the sugroup code shows us 2 = century 10 = 2010 (based on the century) D = Data
Thus we can select just the county names, the county numeric codes, and the LND110210Dcolumn by using the select() function of the dplyr package.
# A tibble: 3,198 x 3
Areaname STCOU LND110210D
<chr> <chr> <dbl>
1 UNITED STATES 00000 3531905.
2 ALABAMA 01000 50645.
3 Autauga, AL 01001 594.
4 Baldwin, AL 01003 1590.
5 Barbour, AL 01005 885.
6 Bibb, AL 01007 623.
7 Blount, AL 01009 645.
8 Bullock, AL 01011 623.
9 Butler, AL 01013 777.
10 Calhoun, AL 01015 606.
# … with 3,188 more rows
countyfipsWe will use the case_when() function of the dplyr package recode the NA values for the rows for the MONGOMERY county of AR to be 05097. First we need to specify for these particular rows. Becuase there Montgomery may be a county name in other states, we need to evaluate when the BUYER_STATE is AR and when the BUYER_COUNTY is MONTGOMERY. We will use the & opperator to indcate that both conditions must be true. We will then recode the coutryfips values for these rows to be "05097" using the ~ symbol. All other values need to stay the same. Thus we need to use TRUE ~ to recode all the other countyfips values to what they currently are. Otherwise these would autmatically be NA.
We are also going to use a special pipe operator from the magrittr package called the compound assignment pipe-operator or sometimes the double pipe operator.
This allows us to use the annualDosage as our input and reassign it at the end after all the subsequent steps have been performed, although in this case it is only one step.
annualDosage %<>%
mutate(countyfips = case_when(BUYER_STATE == "AR" &
BUYER_COUNTY == "MONTGOMERY" ~ as.character("05097"),
TRUE ~ countyfips))Now we can check that we indeed fixed our data.
# A tibble: 0 x 6
# … with 6 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, year <int>,
# count <int>, DOSAGE_UNIT <dbl>, countyfips <chr>
# A tibble: 9 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 MONTGOMERY AR 2006 469 175390 05097
2 MONTGOMERY AR 2007 597 241270 05097
3 MONTGOMERY AR 2008 561 251760 05097
4 MONTGOMERY AR 2009 554 244160 05097
5 MONTGOMERY AR 2010 449 247990 05097
6 MONTGOMERY AR 2011 560 313800 05097
7 MONTGOMERY AR 2012 696 339520 05097
8 MONTGOMERY AR 2013 703 382300 05097
9 MONTGOMERY AR 2014 491 396900 05097
Great! We fixed it.
OK, we also had some rows that didn’t have county names because they were just missing or the data was for US territories. We will remove the values that dont have county names.
First let’s take a look at them agian.
# A tibble: 17 x 6
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
<chr> <chr> <int> <int> <dbl> <chr>
1 <NA> AE 2006 2 330 <NA>
2 <NA> CA 2006 47 12600 <NA>
3 <NA> CT 2006 305 78700 <NA>
4 <NA> CT 2007 112 30900 <NA>
5 <NA> CT 2008 48 15000 <NA>
6 <NA> FL 2006 9 900 <NA>
7 <NA> FL 2007 7 700 <NA>
8 <NA> GA 2006 114 51700 <NA>
9 <NA> IA 2006 7 2300 <NA>
10 <NA> IN 2006 292 39300 <NA>
11 <NA> MA 2006 247 114900 <NA>
12 <NA> NV 2006 380 173600 <NA>
13 <NA> NV 2007 447 200600 <NA>
14 <NA> NV 2008 5 2200 <NA>
15 <NA> OH 2006 23 5100 <NA>
16 <NA> PR 2006 10 17800 <NA>
17 <NA> PR 2007 2 1300 <NA>
We can filter out these values by using the ! exclamation mark before the is.na() function.
And now let’s check that these NA values are gone:
# A tibble: 0 x 6
# … with 6 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, year <int>,
# count <int>, DOSAGE_UNIT <dbl>, countyfips <chr>
Let’s check if our land area data has information for US territories. If not, we will remove the data for the territories in our annualDosage data. However, this would be very interesting and imporatant to investigate. We can use the str_detect() function of the stringr package, which contains lots of functions for looking for patterns in character strings, to look for data from Puerto Rico.
The str_detect() function allows us to look for a particular pattern. It does not have to be the full value, there can be a partial match. Thus we can look to see if there are any PR strings withing the vlaues of the the Areaname variable.
# A tibble: 0 x 3
# … with 3 variables: Areaname <chr>, STCOU <chr>, LND110210D <dbl>
You can see using a different abbreviation, that this code does as intended:
# A tibble: 81 x 3
Areaname STCOU LND110210D
<chr> <chr> <dbl>
1 ARIZONA 04000 113594.
2 ARKANSAS 05000 52035.
3 Arkansas, AR 05001 989.
4 Ashley, AR 05003 925.
5 Baxter, AR 05005 554.
6 Benton, AR 05007 847.
7 Boone, AR 05009 590.
8 Bradley, AR 05011 649.
9 Calhoun, AR 05013 629.
10 Carroll, AR 05015 630.
# … with 71 more rows
OK, so it does mot look like there is any territory land area data in this dataset. Thus we will also remove these from the annualDosage and monthlyDosage tibbles.
Question Opportunity
Do you recall how to do this?
Click here to reveal the code.
Great! Now there is no missing data in our annual data.
Defining if a region is rural or urban is actually quite complicated as the overall population changes, the structure of our towns and cities changes, and the access between different locations changes over time. Please see this report form the US Census Beureau about the history of this definition.
According to several definitions - urban areas are often defined as those with greater than 50,000 people. However, there are also definitions of rural areas being based on “population densities of less than 500 people per square mile and places with fewer than 2,500 people”. Typically counties are made up of multiple areas.
The census estimates rural and urban areas around the US relatively often. However, census collections about these measuresments does not occur every year.
Thus we will define a county as rural or urban based on the population density using the USDA definition that we described above:
rural = population densities of less than 500 people per square mile, as well as places with fewer than 2,500 people
uban = populations densities of greater than 500 people per square mile
Ideally we would want land area from each year, as these do fluctuate a bit, however, this should be a decent approximation as 2010 is in the middle of our time span.
We will therefore calculate the density as the number of people per square mile by dividing the population values by the land area values. To do so we first need to combine our county_area and our county_pop data together. First we want to make sure that we have one column, in our case the column that contains the numeric code for the counties, in the same format and with the same name in both the tibbles that we wish to combine.
We can use the rename() function of the dplyr package to rename the STCOU column to be countyfips. The new name is always listed first before the old name with this function like so: rename(new_name = old_name).
We can use the mutate() funtion of the dplyr package to make the countyfips variable a factor in both tibbles.
What exactly is a factor?
Click here for an explanation of data classes in R
There are several classes of data in R programming. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter "a" or the number "3".
If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense.
If you want your numeric values to be interpreted that way, they need to be converted to a numeric class. The options typically used are integer (which has no decimal place) and double precision (which has a decimal place).
A variable that is a factor has a set of particular values called levels. Even if these are numeric, they will be interpreted as level not as a mathematical numnber. You can modify the order of these levels with the forcats package.
county_pop %<>%
mutate(countyfips = as.factor(countyfips))
county_area %<>%
mutate(countyfips = as.factor(countyfips))Great! Now we are ready to combine our data together.
We can do so using one of the *_join()functions of the dplyr package.
There are several ways to join data using the dplyr package.
See here for more details about joining data.
Since the population data came from the API, we probably have information about opioid pill shipments for each of the included counties. Since the land area data came from a different source, it may contain additional counties that are not in our population or drug shipment data. Thus we will use the left_join() function where x in this case will be the county_pop and y will be the country_area. Thus we will add the LND110210D (land area) values for all counties that match county_pop based on the countyfips column that they have in common.
We are now ready to calculate the population density per square mile. We can create a new column with this data using the mutate() function and the / to divide the population value by the land area value (in square miles) for each county. Let’s also make the year variable a factor.
county_info %<>%
mutate(density = population/LND110210D,
year = as.factor(year))
glimpse(county_info)Rows: 28,265
Columns: 13
$ BUYER_COUNTY <chr> "AUTAUGA", "BALDWIN", "BARBOUR", "BIBB", "BLOUNT", "BULL…
$ BUYER_STATE <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A…
$ countyfips <fct> 01001, 01003, 01005, 01007, 01009, 01011, 01013, 01015, …
$ STATE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ COUNTY <int> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 3…
$ county_name <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount", "Bull…
$ NAME <chr> "Autauga County, Alabama", "Baldwin County, Alabama", "B…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ year <fct> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 20…
$ population <int> 51328, 168121, 27861, 22099, 55485, 10776, 20815, 115388…
$ Areaname <chr> "Autauga, AL", "Baldwin, AL", "Barbour, AL", "Bibb, AL",…
$ LND110210D <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81, 776.83,…
$ density <dbl> 86.34681, 105.75111, 31.48563, 35.49584, 86.05261, 17.30…
Great, now we are ready to create a variable that classifies if a county was rural or urban based on our definition of rural counties being those with less than 500 people per square mile as well as those with less than 2,500 people. We will use the case_when() function of the dplyr package to calssify the new rural_urban variable as either "Urban" or "Rural" based on the evaluations of the density and the population variables. If the density is greater than or equal to 500 people per square mile, then the county will be coded as "Urban", alternatively if the density is less than 500 people per square mile or the population is less than 2500, than the county will be coded as "Rural". The | opperator is used to indicate that either expression should result in coding the county as "Rural"
county_info %<>%
mutate(rural_urban = case_when(density >= 500 ~ "Urban",
density < 500 | population < 2500 ~ "Rural"))We can use the count() function of the dplyr package to see how many of each this resulted in:
# A tibble: 2 x 2
rural_urban n
<chr> <int>
1 Rural 26065
2 Urban 2200
We will now combine the annualDosage data with the count_info tibble.
Question Opportunity
How might we do this?
Click here to reveal the code.
Rows: 27,007
Columns: 16
$ BUYER_COUNTY <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABB…
$ BUYER_STATE <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "L…
$ year <fct> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 20…
$ count <int> 877, 908, 871, 930, 1197, 1327, 1509, 1572, 1558, 5802, …
$ DOSAGE_UNIT <dbl> 363620, 402940, 424590, 467230, 539280, 566560, 589010, …
$ countyfips <fct> 45001, 45001, 45001, 45001, 45001, 45001, 45001, 45001, …
$ STATE <int> 45, 45, 45, 45, 45, 45, 45, 45, 45, 22, 22, 22, 22, 22, …
$ COUNTY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ county_name <chr> "Abbeville", "Abbeville", "Abbeville", "Abbeville", "Abb…
$ NAME <chr> "Abbeville County, South Carolina", "Abbeville County, S…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ population <int> 25821, 25745, 25699, 25347, 25643, 25515, 25387, 25233, …
$ Areaname <chr> "Abbeville, SC", "Abbeville, SC", "Abbeville, SC", "Abbe…
$ LND110210D <dbl> 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, …
$ density <dbl> 52.64435, 52.48940, 52.39561, 51.67795, 52.28144, 52.020…
$ rural_urban <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "R…
Great, now we should have the data that we need for the case study.
Notice how there is a variable called DOSAGE_UNIT. This indicates the number of pills shipped to a pharmacy in this county that were either oxycodone or hydrocodone.
Let’s do a check to see how complete our data is now that we have combined our country_info data with the monthlyDosage and annualDosage data. We will have NA values for any counties present in the DAE data but not in our land area data. We can use the vis_miss() function naniar package to create a plot that shows if we have any missing data.
# A tibble: 27 x 16
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips STATE COUNTY
<chr> <chr> <fct> <int> <dbl> <fct> <int> <int>
1 MONTGOMERY AR 2006 469 175390 05097 NA NA
2 MONTGOMERY AR 2007 597 241270 05097 NA NA
3 MONTGOMERY AR 2008 561 251760 05097 NA NA
4 MONTGOMERY AR 2009 554 244160 05097 NA NA
5 MONTGOMERY AR 2010 449 247990 05097 NA NA
6 MONTGOMERY AR 2011 560 313800 05097 NA NA
7 MONTGOMERY AR 2012 696 339520 05097 NA NA
8 MONTGOMERY AR 2013 703 382300 05097 NA NA
9 MONTGOMERY AR 2014 491 396900 05097 NA NA
10 PRINCE OF W… AK 2006 190 62700 02201 NA NA
# … with 17 more rows, and 8 more variables: county_name <chr>, NAME <chr>,
# variable <chr>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
There does not appear to be land area and/or population data for these counties.
# A tibble: 9 x 14
BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name NAME variable
<chr> <chr> <fct> <int> <int> <chr> <chr> <chr>
1 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
2 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
3 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
4 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
5 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
6 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
7 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
8 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
9 AUTAUGA AL 01001 1 1 Autauga Auta… B01003_…
# … with 6 more variables: year <fct>, population <int>, Areaname <chr>,
# LND110210D <dbl>, density <dbl>, rural_urban <chr>
# A tibble: 0 x 14
# … with 14 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <fct>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
# A tibble: 0 x 14
# … with 14 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <fct>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
# A tibble: 0 x 14
# … with 14 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <fct>, population <int>, Areaname <chr>, LND110210D <dbl>,
# density <dbl>, rural_urban <chr>
# A tibble: 1 x 34
Areaname STCOU LND010190F LND010190D LND010190N1 LND010190N2 LND010200F
<chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 Montgom… 05097 0 800. 0000 0000 0
# … with 27 more variables: LND010200D <dbl>, LND010200N1 <chr>,
# LND010200N2 <chr>, LND110180F <dbl>, LND110180D <dbl>, LND110180N1 <chr>,
# LND110180N2 <chr>, LND110190F <dbl>, LND110190D <dbl>, LND110190N1 <chr>,
# LND110190N2 <chr>, LND110200F <dbl>, LND110200D <dbl>, LND110200N1 <chr>,
# LND110200N2 <chr>, LND110210F <dbl>, LND110210D <dbl>, LND110210N1 <chr>,
# LND110210N2 <chr>, LND210190F <dbl>, LND210190D <dbl>, LND210190N1 <chr>,
# LND210190N2 <chr>, LND210200F <dbl>, LND210200D <dbl>, LND210200N1 <chr>,
# LND210200N2 <chr>
# A tibble: 0 x 34
# … with 34 variables: Areaname <chr>, STCOU <chr>, LND010190F <dbl>,
# LND010190D <dbl>, LND010190N1 <chr>, LND010190N2 <chr>, LND010200F <dbl>,
# LND010200D <dbl>, LND010200N1 <chr>, LND010200N2 <chr>, LND110180F <dbl>,
# LND110180D <dbl>, LND110180N1 <chr>, LND110180N2 <chr>, LND110190F <dbl>,
# LND110190D <dbl>, LND110190N1 <chr>, LND110190N2 <chr>, LND110200F <dbl>,
# LND110200D <dbl>, LND110200N1 <chr>, LND110200N2 <chr>, LND110210F <dbl>,
# LND110210D <dbl>, LND110210N1 <chr>, LND110210N2 <chr>, LND210190F <dbl>,
# LND210190D <dbl>, LND210190N1 <chr>, LND210190N2 <chr>, LND210200F <dbl>,
# LND210200D <dbl>, LND210200N1 <chr>, LND210200N2 <chr>
# A tibble: 0 x 34
# … with 34 variables: Areaname <chr>, STCOU <chr>, LND010190F <dbl>,
# LND010190D <dbl>, LND010190N1 <chr>, LND010190N2 <chr>, LND010200F <dbl>,
# LND010200D <dbl>, LND010200N1 <chr>, LND010200N2 <chr>, LND110180F <dbl>,
# LND110180D <dbl>, LND110180N1 <chr>, LND110180N2 <chr>, LND110190F <dbl>,
# LND110190D <dbl>, LND110190N1 <chr>, LND110190N2 <chr>, LND110200F <dbl>,
# LND110200D <dbl>, LND110200N1 <chr>, LND110200N2 <chr>, LND110210F <dbl>,
# LND110210D <dbl>, LND110210N1 <chr>, LND110210N2 <chr>, LND210190F <dbl>,
# LND210190D <dbl>, LND210190N1 <chr>, LND210190N2 <chr>, LND210200F <dbl>,
# LND210200D <dbl>, LND210200N1 <chr>, LND210200N2 <chr>
# A tibble: 0 x 10
# … with 10 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <int>, population <int>
# A tibble: 0 x 10
# … with 10 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <int>, population <int>
# A tibble: 0 x 10
# … with 10 variables: BUYER_COUNTY <chr>, BUYER_STATE <chr>, countyfips <fct>,
# STATE <int>, COUNTY <int>, county_name <chr>, NAME <chr>, variable <chr>,
# year <int>, population <int>
We will now remove these rows before further analysis:
Question Opportunity
Do you recall how you would do this?
Click here to reveal the code.
Nice! Now we have no missing data.
dens_df <- county_info %>% group_by(BUYER_STATE, year) %>%
summarise( mean_DENS = mean(density, na.rm = TRUE))
ggplot(dens_df, aes(x =BUYER_STATE, y = mean_DENS, col=year, group = year)) +
geom_point() +
theme_minimal()+
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90))We can see that the density is fairly similar for most states, however DC, MA, NJ, NY, RI, and VA have much higher densities. We also see that DC shows the largest change over time.
ggplot(dens_df, aes(x =year, y = log(mean_DENS), col=year, group = year)) +
geom_boxplot() +
geom_jitter(width = .1) +
theme_minimal() +
theme(axis.title.x=element_blank(),
legend.position = "none") +
labs(y = "Log state mean population density")dens_df <- county_info %>% group_by( year) %>%
summarise( mean_DENS = mean(density, na.rm = TRUE))
ggplot(dens_df, aes(x =year, y = mean_DENS)) +
geom_point() +
geom_smooth() +
theme_minimal()+
theme(axis.title.x=element_blank()) +
labs(y = "US mean population density")The density doesn’t appear to change that much from 2006 to 2014.
How does this compare with raw population values?
dens_df <- county_info %>% group_by( year) %>%
summarise( total_population = sum(population, na.rm = TRUE))
ggplot(dens_df, aes(x =year, y = total_population)) +
geom_point() +
geom_smooth() +
theme_minimal()+
theme(axis.title.x=element_blank()) +
labs(y = "US total population")R_U <- county_info %>% group_by( year) %>%
count(rural_urban)
ggplot(R_U, aes(x = year, y = n, col = rural_urban, group = rural_urban)) +
geom_point() +
geom_smooth() +
facet_wrap(~rural_urban, scales = "free") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90),
legend.title = element_blank())It looks as though the number of urban areas has increased, while the number of rural areas has decreased over time.
R_U <- county_info %>% group_by( year) %>%
count(rural_urban)
R_U %<>% pivot_wider( names_from = rural_urban, values_from = n)
R_U %<>% ungroup() %>%
mutate("Rural Change" = Rural - lag(Rural),
"Urban Change"= Urban - lag(Urban),
percent_urban = round(Urban/(Urban +Rural),3)*100)
formattable(R_U, list(`percent_urban` = color_bar("#FA614B")))| year | Rural | Urban | Rural Change | Urban Change | percent_urban |
|---|---|---|---|---|---|
| 2006 | 2903 | 239 | NA | NA | 7.6 |
| 2007 | 2900 | 242 | -3 | 3 | 7.7 |
| 2008 | 2900 | 242 | 0 | 0 | 7.7 |
| 2009 | 2900 | 240 | 0 | -2 | 7.6 |
| 2010 | 2898 | 242 | -2 | 2 | 7.7 |
| 2011 | 2893 | 247 | -5 | 5 | 7.9 |
| 2012 | 2893 | 247 | 0 | 0 | 7.9 |
| 2013 | 2890 | 250 | -3 | 3 | 8.0 |
| 2014 | 2888 | 251 | -2 | 1 | 8.0 |
formattable(R_U, list(`percent_urban` = color_bar("#FA614B"),
`Rural Change` = formatter("span",
style = ~style(
color = ifelse(`Rural Change` < 0, "red", "green"))),
`Urban Change` = formatter("span",
style = ~style(
color = ifelse(`Urban Change` < 0, "red", "green")))))| year | Rural | Urban | Rural Change | Urban Change | percent_urban |
|---|---|---|---|---|---|
| 2006 | 2903 | 239 | NA | NA | 7.6 |
| 2007 | 2900 | 242 | -3 | 3 | 7.7 |
| 2008 | 2900 | 242 | 0 | 0 | 7.7 |
| 2009 | 2900 | 240 | 0 | -2 | 7.6 |
| 2010 | 2898 | 242 | -2 | 2 | 7.7 |
| 2011 | 2893 | 247 | -5 | 5 | 7.9 |
| 2012 | 2893 | 247 | 0 | 0 | 7.9 |
| 2013 | 2890 | 250 | -3 | 3 | 8.0 |
| 2014 | 2888 | 251 | -2 | 1 | 8.0 |
R_U <- county_info %>% group_by(BUYER_STATE, year) %>%
count(rural_urban)
R_U %<>% pivot_wider( names_from = rural_urban, values_from = n)
R_U %<>%mutate(Urban = replace_na(Urban , 0))
R_U %<>% mutate(percent_urban = (Urban/(Urban+Rural))*100)
ggplot(R_U, color = "black", aes(x = BUYER_STATE, y = percent_urban, color = year)) +
geom_point() +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90),
legend.position = "bottom") +
coord_flip() +
scale_colour_viridis_d(option = "magma", end = 0, begin = 1, guide = guide_legend(nrow = 1))ggplot(Annual , aes(x = year, y = DOSAGE_UNIT)) +
geom_boxjitter()+
labs(title = "DOSAGE_UNIT Change Over Year")+
theme_minimal()Annual %>% group_by(BUYER_STATE,year) %>%
summarise( mean_DOSAGE = mean(DOSAGE_UNIT)) %>% ungroup() %>%
ggplot(aes(x = year, y = mean_DOSAGE)) +
geom_boxjitter()+
labs(title = "DOSAGE_UNIT Change Over Year")+
theme_minimal() Annual %>% group_by(BUYER_STATE,year) %>%
summarise( mean_DOSAGE = mean(DOSAGE_UNIT)) %>% ungroup() %>%
ggplot(aes(x = year, y = mean_DOSAGE, group = BUYER_STATE, color = BUYER_STATE)) +
geom_line( )g<-Annual %>% group_by(BUYER_STATE,year) %>%
summarise( mean_DOSAGE = mean(DOSAGE_UNIT)) %>% ungroup() %>%
ggplot(aes(x = year, y = mean_DOSAGE, group = BUYER_STATE, color = BUYER_STATE)) +
geom_line()
g <- g + geom_point_interactive(aes(
color = BUYER_STATE,
tooltip = usdata::abbr2state(BUYER_STATE)),
size = 2,
alpha = 3/10) +theme(legend.position = "nune")
girafe(code = print(g))In this plot it appearst that the counties in California got the largest number of pills shipped. However, since we did not account for population or population density, this could simply be because it is the most populated state. To account for this we will perform something called normalization to make a more fair comparison.
Rows: 26,980
Columns: 17
$ BUYER_COUNTY <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABB…
$ BUYER_STATE <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "L…
$ year <fct> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 20…
$ count <int> 877, 908, 871, 930, 1197, 1327, 1509, 1572, 1558, 5802, …
$ DOSAGE_UNIT <dbl> 363620, 402940, 424590, 467230, 539280, 566560, 589010, …
$ countyfips <fct> 45001, 45001, 45001, 45001, 45001, 45001, 45001, 45001, …
$ STATE <int> 45, 45, 45, 45, 45, 45, 45, 45, 45, 22, 22, 22, 22, 22, …
$ COUNTY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ county_name <chr> "Abbeville", "Abbeville", "Abbeville", "Abbeville", "Abb…
$ NAME <chr> "Abbeville County, South Carolina", "Abbeville County, S…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ population <int> 25821, 25745, 25699, 25347, 25643, 25515, 25387, 25233, …
$ Areaname <chr> "Abbeville, SC", "Abbeville, SC", "Abbeville, SC", "Abbe…
$ LND110210D <dbl> 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, …
$ density <dbl> 52.64435, 52.48940, 52.39561, 51.67795, 52.28144, 52.020…
$ rural_urban <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "R…
$ dens_DOSAGE <dbl> 6907.104, 7676.598, 8103.541, 9041.187, 10314.942, 10891…
Why Density Normalization?
Description about how we also need log transformation
p1 <- ggplot(Annual , aes(x = year, y = DOSAGE_UNIT, colour = year)) +
geom_boxplot()+
labs(title = "Without Normalization")+
theme_minimal()
p2 <- ggplot(Annual, aes(x = year, y = dens_DOSAGE, colour = year)) +
geom_boxplot()+
labs(title = "Density Normalization")+
theme_minimal()
ggarrange(p1, p2, nrow = 2, ncol=2)Annual %>% pivot_longer(names_to = "type",
values_to = "value",
cols = c(DOSAGE_UNIT,
dens_DOSAGE)) %>%
mutate(type = forcats::fct_inorder(type)) %>%
glimpse()Rows: 53,960
Columns: 17
$ BUYER_COUNTY <chr> "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABBEVILLE", "ABB…
$ BUYER_STATE <chr> "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "SC", "S…
$ year <fct> 2006, 2006, 2007, 2007, 2008, 2008, 2009, 2009, 2010, 20…
$ count <int> 877, 877, 908, 908, 871, 871, 930, 930, 1197, 1197, 1327…
$ countyfips <fct> 45001, 45001, 45001, 45001, 45001, 45001, 45001, 45001, …
$ STATE <int> 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, …
$ COUNTY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ county_name <chr> "Abbeville", "Abbeville", "Abbeville", "Abbeville", "Abb…
$ NAME <chr> "Abbeville County, South Carolina", "Abbeville County, S…
$ variable <chr> "B01003_001", "B01003_001", "B01003_001", "B01003_001", …
$ population <int> 25821, 25821, 25745, 25745, 25699, 25699, 25347, 25347, …
$ Areaname <chr> "Abbeville, SC", "Abbeville, SC", "Abbeville, SC", "Abbe…
$ LND110210D <dbl> 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, 490.48, …
$ density <dbl> 52.64435, 52.64435, 52.48940, 52.48940, 52.39561, 52.395…
$ rural_urban <chr> "Rural", "Rural", "Rural", "Rural", "Rural", "Rural", "R…
$ type <fct> DOSAGE_UNIT, dens_DOSAGE, DOSAGE_UNIT, dens_DOSAGE, DOSA…
$ value <dbl> 363620.000, 6907.104, 402940.000, 7676.598, 424590.000, …
Annual %>% pivot_longer(names_to = "type",
values_to = "value",
cols = c(DOSAGE_UNIT,
dens_DOSAGE)) %>%
mutate(type = forcats::fct_inorder(type)) %>%
ggplot( aes(x = year, y = value, colour = year)) +
geom_boxplot()+
facet_wrap(~type, scale = "free")+
theme_minimal() +
theme(axis.text.x = element_text(angle = 90),
legend.position = "none")Annual %>% pivot_longer(names_to = "type",
values_to = "value",
cols = c(DOSAGE_UNIT,
dens_DOSAGE)) %>%
mutate(type = forcats::fct_inorder(type)) %>%
ggplot( aes(x = year, y = log(value), colour = year)) +
geom_boxplot()+
facet_wrap(~type, scale = "free")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90),
legend.position = "none")The goal of normalization is to make every datapoint have the same scale so each feature is equally important.
Annual %>% pivot_longer(names_to = "type",
values_to = "value",
cols = c(DOSAGE_UNIT,
dens_DOSAGE ))%>%
mutate(type = forcats::fct_inorder(type)) %>%
ggplot( aes(y = value, x = year, colour = rural_urban, group = rural_urban)) +
stat_summary(fun.y = mean,
fun.ymin = function(x) mean(x) - sd(x),
fun.ymax = function(x) mean(x) + sd(x),
geom = "pointrange") +
stat_summary(fun.y = mean,
geom = "line") +
facet_wrap( ~ type, scales = "free") +
labs(title = "Dosage Change Without Normalization")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90))p5 <- ggplot(Annual, aes(y = DOSAGE_UNIT, x = year, colour = rural_urban, group = rural_urban)) +
stat_summary(fun.y = mean,
fun.ymin = function(x) mean(x) - sd(x),
fun.ymax = function(x) mean(x) + sd(x),
geom = "pointrange") +
stat_summary(fun.y = mean,
geom = "line") +
facet_wrap( ~ rural_urban) +
labs(title = "Dosage Change Without Normalization")+
theme_minimal()
p6 <- ggplot(Annual , aes(y = dens_DOSAGE, x = year, colour = rural_urban, group = rural_urban)) +
stat_summary(fun.y = mean,
fun.ymin = function(x) mean(x) - sd(x),
fun.ymax = function(x) mean(x) + sd(x),
geom = "pointrange") +
stat_summary(fun.y = mean,
geom = "line") +
facet_wrap( ~ rural_urban) +
labs(title = "Dosage Change - Density Normalization")+
theme_minimal()
ggarrange(p5, p6, ncol= 2 )http://www.sthda.com/english/wiki/unpaired-two-samples-t-test-in-r
Two Sample t-test
data: DOSAGE_UNIT by rural_urban
t = -99.117, df = 26978, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-17804308 -17113799
sample estimates:
mean in group Rural mean in group Urban
2235774 19694828
Two Sample t-test
data: dens_DOSAGE by rural_urban
t = 14.652, df = 26978, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
19978.13 26148.72
sample estimates:
mean in group Rural mean in group Urban
39031.93 15968.51
We can see that the result is different depending on how we normalize the data!
In this case study we have demonstrated the basics of R Markdown and how to create a dashboard with using the flexdashboard package. We also demonstrated how to include an interactive table with the DT package, how to include interactive plots using functions of the shiny package such as renderPlot(). We also included interactive value boxes using the renderValueBox() function of the flexdashboard package which works with the shiny package. Finally we also showed how to include interactive maps using the leaflet package.
This case study also explored how to properly calculate and interpret percentages when the data has missing values. We also discussed the benefits and limiting aspects of pie charts (using ggplot2) and waffle plots (using waffle).
Overall the dashboard that we created shows that the number of shootings per year has increased overtime. Further investigation is necessary to determine if this is simply due to increases in population alone or if the rate has increased due to other factors and if so, what those factors might be. It is also clear that the number of shootings and the number of deaths per capita varies by state. Thus there appears to be other aspects accounting for state differences.
Create another dashboard with graphs and statistics featuring other elements within this dataset. For example, students may create graphs that explore what school events are reported to have more shootings.
RStudio
Cheatsheet on RStuido IDE
Other RStudio cheatsheets
RStudio projects
String manipulation cheatsheet
Table formats
Geocoding
Coordinate reference system (CRS) ESPG World Geodetic System (WGS) version 84 also called ESPG:4326
Albers equal-area conic projection
crs 102008
To learn more about geospatial coordinate systems see here and here.
ggplot2 package
Please see this case study for more details on using ggplot2
grammar of graphics
ggplot2 themes
Motivating article for this case study about school shootings
Also see this article to learn more about the impacts of school shootings.
Lightweight markup languages(LML)
Markdown
R markdown
knitr
rmarkdown (package)
See this book for more information on working with R Markdown files.
The RStudio cheatsheet for R Markdown and this tutorial are great for getting started.
See here for a video about flexdashboard and here for a more information on how to use this package.
See here for a list of other packages that are useful for adding elements to dashboards created with the flexdashboard package.
See here for a list of R Markdown themes which can be used with flexdashbard.
See Font Awesome for icons.
To learn more about using shiny with the flexdashboard package to create interactive dashboards, see this tutorial.
leaflet (R package)
Leaflet (JavaScript Library)
shiny
See here for a gallery of shiny examples.
See this website to learn about a more flexible and slightly more challenging option for creating dashboards in R using a package called shinydashboard.
Packages used in this case study:
| Package | Use in this case study |
|---|---|
| here | to easily load and save data |
| readr | to import the data as a csv file |
| googlesheets4 | to import directly from Google Sheets |
| tibble | to create tibbles (the tidyverse version of dataframes) |
| dplyr | to filter, subset, join, add rows to, and modify the data |
| stringr | to manipulate character strings within the data (collapsing strings together, replace values, and detect values) |
| magrittr | to pipe sequential commands |
| tidyr | to change the shape or format of tibbles to wide and long, to drop rows with NA values, and to see the last few columns of a tibble |
| ggmap | to geocode the data (which means get the latitude and longitude values) |
| sf | to modify the geocoded data so that overlapping points did not overlap |
| lubridate | to work with the data-time data |
| DT | to create the interactive table |
| htmltools | to add a caption to our interactive table |
| ggplot2 | to create plots |
| ggforce | to create a plot zoom |
| forcats | to reorder factor for plot |
| waffle | to make waffle proportion plots |
| poliscidata | to get population values for the states |
| flexdashboard | to create the dashboard |
| shiny | to allow our dashboard to be interactive |
| leaflet | to implement the leaflet (a JavaScript library for maps) to create the map for our dashboard |
If you or a loved one is stuggling with opioid addiction, contact the SAMHSA’s National Helpline at 1-800-662-HELP (4357).
It is a free, confidential, 24/7, 365-day-a-year treatment referral and information service (in English and Spanish) for individuals and families facing mental and/or substance use disorders.
You can also contact the Addiction Center at (877)871-3575 which also has a confidential 24/7 live chat at: https://www.addictioncenter.com/drugs/overdose/.
According to their website:
Remember, that being able to treat an overdose at home is not a replacement for a hospital. Even if the moment has passed, and the victim seems fine, there is still a chance that something is going on that cannot be seen by the human eye. Taking the victim to the hospital, can mean the difference between life and death.
Overdose is a scary word. We often associate it with death, but the two are not always connected. Life can go on after an overdose, but only if the person suffering understands and learns from it. Getting on the road to recovery is not easily done but it is always possible, and the only guaranteed way to never suffer an overdose again. If you don’t know where this path begins, or need help getting help for a loved one, please reach out to a dedicated treatment provider. They’re here, 24/7, to answer any questions you may have. Be it for yourself or someone else.
According to harmreduction.org, the following are signs of an overdose:
If someone is making unfamiliar sounds while “sleeping” it is worth trying to wake him or her up. Many loved ones of users think a person was snoring, when in fact the person was overdosing. These situations are a missed opportunity to intervene and save a life.
Sometimes it can be difficult to tell if a person is just very high, or experiencing an overdose. If you’re having a hard time telling the difference, it is best to treat the situation like an overdose – it could save someone’s life.
The most important thing is to act right away!
R version 4.0.1 (2020-06-06)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gdtools_0.2.2 formattable_0.2.0.1 ggpubr_0.4.0
[4] ggiraph_0.7.8 ggpol_0.0.6 ggplot2_3.3.1
[7] naniar_0.5.1 tidyr_1.1.0 dplyr_1.0.0
[10] magrittr_1.5 readr_1.3.1 stringr_1.4.0
[13] jsonlite_1.7.1 httr_1.4.2 tibble_3.0.1
[16] readxl_1.3.1 knitr_1.28 here_0.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 lattice_0.20-41 assertthat_0.2.1 rprojroot_1.3-2
[5] digest_0.6.25 utf8_1.1.4 R6_2.4.1 cellranger_1.1.0
[9] plyr_1.8.6 backports_1.1.7 visdat_0.5.3 evaluate_0.14
[13] pillar_1.4.4 rlang_0.4.6 curl_4.3 uuid_0.1-4
[17] data.table_1.12.8 car_3.0-8 Matrix_1.2-18 rmarkdown_2.2
[21] splines_4.0.1 labeling_0.3 foreign_0.8-80 htmlwidgets_1.5.1
[25] munsell_0.5.0 broom_0.5.6 compiler_4.0.1 xfun_0.14
[29] pkgconfig_2.0.3 systemfonts_0.2.2 base64enc_0.1-3 mgcv_1.8-31
[33] htmltools_0.4.0 tidyselect_1.1.0 rio_0.5.16 viridisLite_0.3.0
[37] fansi_0.4.1 crayon_1.3.4 withr_2.2.0 grid_4.0.1
[41] nlme_3.1-148 gtable_0.3.0 lifecycle_0.2.0 scales_1.1.1
[45] zip_2.0.4 cli_2.0.2 stringi_1.4.6 carData_3.0-4
[49] farver_2.0.3 ggsignif_0.6.0 ellipsis_0.3.1 generics_0.0.2
[53] vctrs_0.3.1 cowplot_1.0.0 openxlsx_4.1.5 tools_4.0.1
[57] forcats_0.5.0 glue_1.4.1 purrr_0.3.4 hms_0.5.3
[61] abind_1.4-5 yaml_2.2.1 usdata_0.1.0 colorspace_1.4-1
[65] rstatix_0.6.0 haven_2.3.1
We would like to acknowledge Elizabeth Stuart for assisting in framing the major direction of the case study.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.